Candace Savonen - CCDL for ALSF

This notebook sets up the MAF data files as a combined file for ready comparison in 02-analyze-concordance.Rmd and also does some first line analyses from ready-made maftools functions.

It is the first notebook in this series which addresses issue # 30 in OpenPBTA.

Summary of the Set Up:

For both the Strelka2 and MuTect2 datasets, the data is imported by maftools::read.maf and the corresponding clinical data from pbta-histologies.tsv is added to the object. This is only done once as written (as the read.maf is very memory intensive) and each MuTect2 and Strelka2 are saved to an RDS file for faster and ready reloading.

For both Strelka2 and MuTect2 data, the following variables are calculated and added into the combined dataset for analysis in 02-analyze-concordance.Rmd.

Variables created:

  • vaf : Variant Allele Frequency = (t_alt_count) / (t_ref_count + t_alt_count).
  • mutation_id: Used to determine whether two variants calls are identical in both datasets. It is a concatenation of: Hugo_Symbol, change, Start_Position, and Tumor_Sample_Barcode (the sample ID).
  • base_change: variable that indicates the exact change in bases (e.g. ‘T>C’).
  • change: indicates the base_change information but groups together deletions, insertions, and long (more than a SNV) as their own groups.
  • coding: Summarize the BIOTYPE variable for whether or not it is a coding gene.

The output files from this notebook:

  • scratch/strelka2.RDS
  • scratch/mutect2.RDS
  • scratch/metadata_filtered_maf_samples.tsv
  • analyses/mutect2-vs-strelka2/plots/sample_cor_mutect2_vs_strelka2.pdf
  • analyses/mutect2-vs-strelka2/plots/sample_cor_mutect2_vs_strelka2.pdf
  • analyses/mutect2-vs-strelka2/results/combined_results.tsv

Usage

To run this from the command line, use:

Rscript -e "rmarkdown::render('analyses/mutect2-vs-strelka2/01-set-up.Rmd', 
                              clean = TRUE)"

This assumes you are in the top directory of the repository.

Set Up

# We need maftools - this will be added to the running Docker issue whenever it is up
if (!("maftools" %in% installed.packages())) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
  }
  BiocManager::install("maftools")
}

# Will need hexbin for the hex plot
if (!("hexbin" %in% installed.packages())) {
  install.packages("hexbin")
}

Get magrittr pipe

`%>%` <- dplyr::`%>%`

Directories and files

Path to the symlinked data obtained via bash download-data.sh.

data_dir <- file.path("..", "..", "data")
scratch_dir <- file.path("..", "..", "scratch")

Create output directories in this analysis folder.

if (!dir.exists("results")) {
  dir.create("results")
}
if (!dir.exists("plots")) {
  dir.create("plots")
}

Functions

Set up the function that will create the new variables from the maf objects.

Description of variables

The function, set_up_variables, will calculate the following variables from a maf object:
- Calculate VAF for each. The VAFs for each variation in each dataset are calculated by t_alt_count) / (t_ref_count + t_alt_count), as following the code used in maftools
- Create a base_change variable that indicates the exact change in bases.
- Create a change variable for each which indicates the base_change information but groups together deletions, insertions, and long (more than a SNV) as their own groups.
- Make a mutation id by concatenating Hugo_Symbol, change, Start_Position, and Tumor_Sample_Barcode (the sample ID).
We will use this variable to determine whether two variants calls are identical in both datasets. - Summarize the BIOTYPE variable for whether or not it is a coding gene.

set_up_variables <- function(maf = NULL) {
  # Creates these new variables from a maf object provided: VAF, mutation_id, 
  # base_change, change, coding.
  #
  # Args:
  #   maf: a maf object to create new variables from
  #
  # Returns:
  #   a data.frame with all the original information in the `@data` part of the 
  #   maf object but with these new variables: VAF, mutation_id, base_change, 
  #   change, coding.

  # Extract the data part of the maf object, put it through a dplyr pipe. 
  df <- maf@data
  df %>%
    dplyr::mutate(
      # Calculate the variant allele frequency
      vaf = as.numeric(t_alt_count) / (as.numeric(t_ref_count) +
                                         as.numeric(t_alt_count)),
      # Create a base_change variable
      base_change = paste0(Reference_Allele, ">", Allele),
      
      # Create a variable that notes whether the variant is in a 
      # coding/non-coding region
      coding = dplyr::case_when(
        BIOTYPE != "protein_coding" ~ "non-coding",
        TRUE ~ "protein_coding"
      )
    ) %>%
    dplyr::mutate(
      # From the base_change variable, summarize insertions, deletions, and 
      # changes that are more than one base into their own groups.
      change = dplyr::case_when(
      grepl("^-", base_change) ~ "insertion",
      grepl("-$", base_change) ~ "deletion",
      nchar(base_change) > 3 ~ "long_change",
      TRUE ~ base_change
    )) %>%
    dplyr::mutate(
      # Create the mutation id based on the change variable as well as the 
      # gene symbol, start position, and sample ID. 
      mutation_id = paste0(
        Hugo_Symbol, "_",
        change, "_",
        Start_Position, "_",
        Tumor_Sample_Barcode
      )
    ) %>%
    # Get rid of any variables that have completely NAs.
    dplyr::select(-which(apply(is.na(.), 2, all)))
}

Summarize and compare maf objects function

This function uses the maftools summary functions, maftools::getGeneSummary or maftools::getSampleSummary, to summarize and compare two maf objects. These maftools summary functions obtain the number of variants of each classification type on either a gene or sample level (this is taked from the information in the Variant_Classification field in the original maf file.) This field has the translation effect of the variant, e.g. “Frame_Shift_Del”.

correlate_maf_summaries <- function(maf1 = NULL, maf2 = NULL, 
                                    maf1_name = "maf1", maf2_name = "maf2", 
                                    summarize_by = "gene") {
  # Takes two maf files, summarizes them by gene or sample, combines them by 
  # a full join, and correlates the summaries using Pearson's and Spearman's
  # correlations. 
  #
  # Args:
  #   maf1: a first maf object to summarize and compare to maf2
  #   maf2: a second maf object to summarize and compare to maf1
  #   maf1_name: a character string that indicates what label you would like to 
  #   use for maf1's data. 
  #   maf2_name: a character string that indicates what label you would like to 
  #   use for maf2's data. 
  #   summarize_by: a single character string signifying either "gene" or "sample"
  #       This will indicate whether to summarize compare by gene or sample. 
  #       "gene" will summarize both maf1 and maf2 using maftools::getGeneSummary
  #       "sample" will summarize both maf1 and maf2 using maftools::getSampleSummary
  # Returns:
  #   a data.frame that contains the correlation r and p values for both 
  #   maf1 and maf2 based on the numbers of variants of each class.
  
  # Summarize the maf objects by the specified argument
  if (summarize_by == "gene") {
    maf1_sum <- maftools::getGeneSummary(maf1)
    maf2_sum <- maftools::getGeneSummary(maf2)
    key <- "Hugo_Symbol"
  }
  if (summarize_by == "sample") {
    maf1_sum <- maftools::getSampleSummary(maf1)
    maf2_sum <- maftools::getSampleSummary(maf2)
    key <- "Tumor_Sample_Barcode"
  }
  
  # Do a full join of both summaries.
  maf1_sum %>%
    dplyr::full_join(maf2_sum, by = key) %>%
    # Melt this data.frame so we can make it long format
    reshape2::melt(id = key) %>%
    dplyr::mutate(dataset = as.character(grepl(".x$", variable))) %>%
    # Make a new column that specifies what maf object the data is from
    dplyr::mutate(dataset = dplyr::recode(dataset,
      `TRUE` = maf1_name,
      `FALSE` = maf2_name,
    )) %>%
    # Gets rid of the ".x" and ".y" specifications of the column names
    dplyr::mutate(variable = gsub(".x$|.y$", "", variable)) %>%
    # Spreads the data based on the new dataset variable
    tidyr::spread("dataset", "value") %>%  
    
    # Get correlations by each type of variant classification
    dplyr::group_by(variable) %>%
    dplyr::summarize(pearson.cor = cor.test(strelka2, mutect2, method = "pearson")$estimate,
                     pearson.pval = cor.test(strelka2, mutect2, method = "pearson")$p.value,
                     spearman.cor = cor.test(strelka2, mutect2, method = "spearman")$estimate,
                     spearman.pval = cor.test(strelka2, mutect2, method = "pearson")$p.value) 
}

Read in the metadata information

Running maftools::read.maf takes a lot of computing power and time, so to avoid having to run this for both datasets everytime we want to re-run this notebook or the analyses in the other notebook, I’ve set this up to save the MAF objects as RDS files.

First let’s establish the file paths.

# File paths for the needed files for this analysis
metadata_dir <- file.path(scratch_dir, "metadata_filtered_maf_samples.tsv")
strelka2_dir <- file.path(scratch_dir, "strelka2.RDS")
mutect2_dir <- file.path(scratch_dir, "mutect2.RDS")

Read in the Strelka2 and Mutect2 data

We will read in the data as maftools objects from an RDS file, unless maftools has not been run on them yet. We will use metadata as the clinicalData for the maftools object.

Note: If you trying to run the initial set up step in a Docker container, it will likely be out of memory killed, unless you have ~50GB you can allocate to Docker.

# Get a vector of whether these exist
files_needed <- file.exists(metadata_dir, strelka2_dir, mutect2_dir)

if (all(files_needed)) {
  # Read the ready-to-go files if these files exist
  metadata <- metadata <- readr::read_tsv(metadata_dir)
  strelka2 <- readRDS(strelka2_dir)
  mutect2 <- readRDS(mutect2_dir)
} else { # If any of the needed files don't exist, rerun this process:
  # Only import the sample names
  strelka2_samples <- data.table::fread(file.path(
    data_dir,
    "pbta-snv-strelka2.vep.maf.gz"
  ),
  select = "Tumor_Sample_Barcode",
  skip = 1,
  data.table = FALSE
  ) %>%
    dplyr::pull("Tumor_Sample_Barcode")

  mutect2_samples <- data.table::fread(file.path(
    data_dir,
    "pbta-snv-mutect2.vep.maf.gz"
  ),
  select = "Tumor_Sample_Barcode",
  skip = 1,
  data.table = FALSE
  ) %>%
    dplyr::pull("Tumor_Sample_Barcode")

  # Isolate metadata to only the samples that are in the datasets
  metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv")) %>%
    dplyr::filter(Kids_First_Biospecimen_ID %in% c(strelka2_samples, mutect2_samples)) %>%
    dplyr::distinct(Kids_First_Biospecimen_ID, .keep_all = TRUE) %>%
    dplyr::arrange() %>%
    dplyr::rename(Tumor_Sample_Barcode = Kids_First_Biospecimen_ID) %>%
    readr::write_tsv(file.path(scratch_dir, "metadata_filtered_maf_samples.tsv"))

  # Read in original strelka file with maftools
  strelka2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-strelka2.vep.maf.gz"),
    clinicalData = metadata
  )

  # Save to RDS so we don't have to run this again
  saveRDS(strelka2, strelka2_dir)

  # Same for MuTect2
  mutect2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-mutect2.vep.maf.gz"),
    clinicalData = metadata
  )
  saveRDS(mutect2, mutect2_dir)
}
Parsed with column specification:
cols(
  .default = col_character(),
  age_at_diagnosis = col_double(),
  molecular_subtype = col_logical()
)
See spec(...) for full column specifications.

Compare number of mutations per gene

gene_cors <- correlate_maf_summaries(maf1 = strelka2, 
                                     maf2 = mutect2, 
                                     maf1_name = "strelka2", 
                                     maf2_name = "mutect2",
                                     summarize_by = "gene")
Cannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with ties
# Print out this data.frame
gene_cors %>%
  ggplot2::ggplot(ggplot2::aes(x = variable, y = pearson.cor)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))  

Compare number of mutations per sample

sample_cors <- correlate_maf_summaries(maf1 = strelka2, 
                                       maf2 = mutect2, 
                                       maf1_name = "strelka2", 
                                       maf2_name = "mutect2",
                                       summarize_by = "sample")
Cannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with tiesCannot compute exact p-value with ties
# Print out this data.frame
sample_cors %>%
  ggplot2::ggplot(ggplot2::aes(x = variable, y = pearson.cor)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))  

Plot Transition/Transversions

These built in maftools functions plot the overall presence of transitions and transversions and is recommended by the maftools vignette as a method of obtaining an overall base change summary.

Strelka2 transition/transversion plot.

maftools::plotTiTv(maftools::titv(strelka2))

MuTect2 transition/transversion plot.

maftools::plotTiTv(maftools::titv(mutect2))

Set up new variables

Let’s set up Strelka2’s variables first.

# Use the premade function to create our new variables
strelka2_vaf <- set_up_variables(strelka2)
NAs introduced by coercion
# Take a look at this df
strelka2_vaf

Now we will do the same for MuTect2.

# Use the premade function to create our new variables
mutect2_vaf <- set_up_variables(mutect2)

# Take a look at this df
mutect2_vaf

Combine MuTect2 and Strelka2 data.frames into one data.frame

Join these datasets based on the mutation_id created by the set_up_variables function. Save to a TSV file to be used in the subsequent notebook.

# Merge these data.frames together
vaf_df <- strelka2_vaf %>%
  dplyr::full_join(mutect2_vaf,
    by = "mutation_id",
    suffix = c(".strelka2", ".mutect2")
  ) %>%
  # Make a variable that denotes which dataset it is in.
  dplyr::mutate(dataset = dplyr::case_when(
    is.na(Allele.mutect2) ~ "strelka2_only",
    is.na(Allele.strelka2) ~ "mutect2_only",
    TRUE ~ "both"
  )) %>%
  readr::write_tsv(file.path("results", "combined_results.tsv"))

Make a zipped up version that can be stored on GitHub.

zip(
  file.path("results", "combined_results.tsv.zip"),
  file.path("results", "combined_results.tsv")
)
updating: results/combined_results.tsv (deflated 85%)

Session Info:

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biobase_2.44.0      BiocGenerics_0.30.0

loaded via a namespace (and not attached):
 [1] pkgload_1.0.2      tidyr_0.8.3        jsonlite_1.6       splines_3.6.1      foreach_1.4.7      assertthat_0.2.1   BiocManager_1.30.4
 [8] yaml_2.2.0         remotes_2.1.0      sessioninfo_1.1.1  pillar_1.4.2       backports_1.1.4    lattice_0.20-38    glue_1.3.1        
[15] digest_0.6.20      RColorBrewer_1.1-2 colorspace_1.4-1   htmltools_0.3.6    Matrix_1.2-17      plyr_1.8.4         pkgconfig_2.0.2   
[22] devtools_2.1.0     bibtex_0.4.2       purrr_0.3.2        xtable_1.8-4       scales_1.0.0       processx_3.4.1     tibble_2.1.3      
[29] pkgmaker_0.27      ggplot2_3.2.1      usethis_1.5.1      withr_2.1.2        hexbin_1.27.3      lazyeval_0.2.2     cli_1.1.0         
[36] survival_2.44-1.1  magrittr_1.5       crayon_1.3.4       memoise_1.1.0      evaluate_0.14      ps_1.3.0           fs_1.3.1          
[43] doParallel_1.0.15  NMF_0.21.0         pkgbuild_1.0.4     tools_3.6.1        registry_0.5-1     data.table_1.12.2  prettyunits_1.0.2 
[50] hms_0.5.0          gridBase_0.4-7     stringr_1.4.0      munsell_0.5.0      cluster_2.1.0      rngtools_1.4       maftools_2.0.15   
[57] callr_3.3.1        compiler_3.6.1     rlang_0.4.0        grid_3.6.1         iterators_1.0.12   rstudioapi_0.10    base64enc_0.1-3   
[64] labeling_0.3       rmarkdown_1.14     testthat_2.2.1     gtable_0.3.0       codetools_0.2-16   reshape2_1.4.3     R6_2.4.0          
[71] knitr_1.24         dplyr_0.8.3        zeallot_0.1.0      rprojroot_1.3-2    readr_1.3.1        desc_1.2.0         stringi_1.4.3     
[78] Rcpp_1.0.2         vctrs_0.2.0        wordcloud_2.6      tidyselect_0.2.5   xfun_0.8          
---
title: "Set up combined data of Mutect2 and Strelka2"
output:   
  html_notebook: 
    toc: true
    toc_float: true
---

Candace Savonen - CCDL for ALSF

This notebook sets up the [MAF data files as a combined file for ready comparison in 02-analyze-concordance.Rmd](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/mutect2-vs-strelka2/02-analyze-concordance.Rmd)
and also does some first line analyses from ready-made `maftools` functions.

It is the first notebook in this series which addresses [issue \# 30 in OpenPBTA](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/30).

### Summary of the Set Up:  

For both the Strelka2 and MuTect2 datasets, the data is imported by 
`maftools::read.maf` and the corresponding clinical data from 
`pbta-histologies.tsv` is added to the object. 
This is only done once as written (as the `read.maf` is very memory intensive)
and each MuTect2 and Strelka2 are saved to an RDS file for faster and ready 
reloading.

For both Strelka2 and MuTect2 data, the following variables are calculated
and added into the combined dataset for analysis in [02-analyze-concordance.Rmd](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/mutect2-vs-strelka2/02-analyze-concordance.Rmd).

#### Variables created:  

- `vaf` : Variant Allele Frequency = `(t_alt_count) / (t_ref_count + t_alt_count)`.  
- `mutation_id`: Used to determine whether two variants calls are identical 
in both datasets. 
It is a concatenation of: `Hugo_Symbol`, `change`, `Start_Position`, and 
`Tumor_Sample_Barcode` (the sample ID).   
- `base_change`: variable that indicates the exact change in bases (e.g. 'T>C').  
- `change`: indicates the `base_change` information but groups together 
deletions, insertions, and long (more than a SNV) as their own groups.  
- `coding`: Summarize the `BIOTYPE` variable for whether or not it is a coding gene.  

#### The output files from this notebook:

- `scratch/strelka2.RDS`  
- `scratch/mutect2.RDS`  
- `scratch/metadata_filtered_maf_samples.tsv`  
- `analyses/mutect2-vs-strelka2/plots/sample_cor_mutect2_vs_strelka2.pdf`
- `analyses/mutect2-vs-strelka2/plots/sample_cor_mutect2_vs_strelka2.pdf`
- `analyses/mutect2-vs-strelka2/results/combined_results.tsv`  

## Usage

To run this from the command line, use:
```
Rscript -e "rmarkdown::render('analyses/mutect2-vs-strelka2/01-set-up.Rmd', 
                              clean = TRUE)"
```
 _This assumes you are in the top directory of the repository._

## Set Up

```{r}
# We need maftools - this will be added to the running Docker issue whenever it is up
if (!("maftools" %in% installed.packages())) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
  }
  BiocManager::install("maftools")
}

# Will need hexbin for the hex plot
if (!("hexbin" %in% installed.packages())) {
  install.packages("hexbin")
}
```

Get `magrittr` pipe

```{r}
`%>%` <- dplyr::`%>%`
```

### Directories and files

Path to the symlinked data obtained via `bash download-data.sh`.

```{r}
data_dir <- file.path("..", "..", "data")
scratch_dir <- file.path("..", "..", "scratch")
```

Create output directories in this analysis folder.

```{r}
if (!dir.exists("results")) {
  dir.create("results")
}
if (!dir.exists("plots")) {
  dir.create("plots")
}
```

### Functions

Set up the function that will create the new variables from the maf objects. 

#### Description of variables 
The function, `set_up_variables`, will calculate the following variables from a maf object:  
- Calculate VAF for each. 
The VAFs for each variation in each dataset are calculated by 
`t_alt_count) / (t_ref_count + t_alt_count) `, as following the [code used in 
`maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39)  
- Create a `base_change` variable that indicates the exact change in bases.  
- Create a `change` variable for each which indicates the `base_change`
information but groups together deletions, insertions, and long (more than a 
SNV) as their own groups.  
- Make a mutation id by concatenating `Hugo_Symbol`, `change`, `Start_Position`,
and `Tumor_Sample_Barcode` (the sample ID).   
We will use this variable to determine whether two variants calls are identical 
in both datasets. 
- Summarize the `BIOTYPE` variable for whether or not it is a coding gene.   

```{r}
set_up_variables <- function(maf = NULL) {
  # Creates these new variables from a maf object provided: VAF, mutation_id, 
  # base_change, change, coding.
  #
  # Args:
  #   maf: a maf object to create new variables from
  #
  # Returns:
  #   a data.frame with all the original information in the `@data` part of the 
  #   maf object but with these new variables: VAF, mutation_id, base_change, 
  #   change, coding.

  # Extract the data part of the maf object, put it through a dplyr pipe. 
  df <- maf@data
  df %>%
    dplyr::mutate(
      # Calculate the variant allele frequency
      vaf = as.numeric(t_alt_count) / (as.numeric(t_ref_count) +
                                         as.numeric(t_alt_count)),
      # Create a base_change variable
      base_change = paste0(Reference_Allele, ">", Allele),
      
      # Create a variable that notes whether the variant is in a 
      # coding/non-coding region
      coding = dplyr::case_when(
        BIOTYPE != "protein_coding" ~ "non-coding",
        TRUE ~ "protein_coding"
      )
    ) %>%
    dplyr::mutate(
      # From the base_change variable, summarize insertions, deletions, and 
      # changes that are more than one base into their own groups.
      change = dplyr::case_when(
      grepl("^-", base_change) ~ "insertion",
      grepl("-$", base_change) ~ "deletion",
      nchar(base_change) > 3 ~ "long_change",
      TRUE ~ base_change
    )) %>%
    dplyr::mutate(
      # Create the mutation id based on the change variable as well as the 
      # gene symbol, start position, and sample ID. 
      mutation_id = paste0(
        Hugo_Symbol, "_",
        change, "_",
        Start_Position, "_",
        Tumor_Sample_Barcode
      )
    ) %>%
    # Get rid of any variables that have completely NAs.
    dplyr::select(-which(apply(is.na(.), 2, all)))
}
```

#### Summarize and compare `maf` objects function 

This function uses the maftools summary functions, `maftools::getGeneSummary` or
`maftools::getSampleSummary`, to summarize and compare two maf objects.
These `maftools` summary functions obtain the number of variants of each 
classification type on either a gene or sample level (this is taked from the 
information in the [`Variant_Classification` field in the original maf file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/).)
This field has the translation effect of the variant, e.g. "Frame_Shift_Del".

```{r}
correlate_maf_summaries <- function(maf1 = NULL, maf2 = NULL, 
                                    maf1_name = "maf1", maf2_name = "maf2", 
                                    summarize_by = "gene") {
  # Takes two maf files, summarizes them by gene or sample, combines them by 
  # a full join, and correlates the summaries using Pearson's and Spearman's
  # correlations. 
  #
  # Args:
  #   maf1: a first maf object to summarize and compare to maf2
  #   maf2: a second maf object to summarize and compare to maf1
  #   maf1_name: a character string that indicates what label you would like to 
  #   use for maf1's data. 
  #   maf2_name: a character string that indicates what label you would like to 
  #   use for maf2's data. 
  #   summarize_by: a single character string signifying either "gene" or "sample"
  #       This will indicate whether to summarize compare by gene or sample. 
  #       "gene" will summarize both maf1 and maf2 using maftools::getGeneSummary
  #       "sample" will summarize both maf1 and maf2 using maftools::getSampleSummary
  # Returns:
  #   a data.frame that contains the correlation r and p values for both 
  #   maf1 and maf2 based on the numbers of variants of each class.
  
  # Summarize the maf objects by the specified argument
  if (summarize_by == "gene") {
    maf1_sum <- maftools::getGeneSummary(maf1)
    maf2_sum <- maftools::getGeneSummary(maf2)
    key <- "Hugo_Symbol"
  }
  if (summarize_by == "sample") {
    maf1_sum <- maftools::getSampleSummary(maf1)
    maf2_sum <- maftools::getSampleSummary(maf2)
    key <- "Tumor_Sample_Barcode"
  }
  
  # Do a full join of both summaries.
  maf1_sum %>%
    dplyr::full_join(maf2_sum, by = key) %>%
    # Melt this data.frame so we can make it long format
    reshape2::melt(id = key) %>%
    dplyr::mutate(dataset = as.character(grepl(".x$", variable))) %>%
    # Make a new column that specifies what maf object the data is from
    dplyr::mutate(dataset = dplyr::recode(dataset,
      `TRUE` = maf1_name,
      `FALSE` = maf2_name,
    )) %>%
    # Gets rid of the ".x" and ".y" specifications of the column names
    dplyr::mutate(variable = gsub(".x$|.y$", "", variable)) %>%
    # Spreads the data based on the new dataset variable
    tidyr::spread("dataset", "value") %>%  
    
    # Get correlations by each type of variant classification
    dplyr::group_by(variable) %>%
    dplyr::summarize(pearson.cor = cor.test(strelka2, mutect2, method = "pearson")$estimate,
                     pearson.pval = cor.test(strelka2, mutect2, method = "pearson")$p.value,
                     spearman.cor = cor.test(strelka2, mutect2, method = "spearman")$estimate,
                     spearman.pval = cor.test(strelka2, mutect2, method = "pearson")$p.value) 
}
```


## Read in the metadata information

Running `maftools::read.maf` takes a lot of computing power and time, so to 
avoid having to run this for both datasets everytime we want to re-run this 
notebook or the analyses in the other notebook, I've set this up to save the 
`MAF` objects as `RDS` files.

First let's establish the file paths.

```{r}
# File paths for the needed files for this analysis
metadata_dir <- file.path(scratch_dir, "metadata_filtered_maf_samples.tsv")
strelka2_dir <- file.path(scratch_dir, "strelka2.RDS")
mutect2_dir <- file.path(scratch_dir, "mutect2.RDS")
```

## Read in the Strelka2 and Mutect2 data

We will read in the data as `maftools` objects from an RDS file, unless 
`maftools` has not been run on them yet.
We will use `metadata` as the `clinicalData` for the `maftools` object. 
 
Note: If you trying to run the initial set up step in a Docker container, it 
will likely be out of memory killed, unless you have ~50GB you can allocate to 
Docker. 

```{r}
# Get a vector of whether these exist
files_needed <- file.exists(metadata_dir, strelka2_dir, mutect2_dir)

if (all(files_needed)) {
  # Read the ready-to-go files if these files exist
  metadata <- metadata <- readr::read_tsv(metadata_dir)
  strelka2 <- readRDS(strelka2_dir)
  mutect2 <- readRDS(mutect2_dir)
} else { # If any of the needed files don't exist, rerun this process:
  # Only import the sample names
  strelka2_samples <- data.table::fread(file.path(
    data_dir,
    "pbta-snv-strelka2.vep.maf.gz"
  ),
  select = "Tumor_Sample_Barcode",
  skip = 1,
  data.table = FALSE
  ) %>%
    dplyr::pull("Tumor_Sample_Barcode")

  mutect2_samples <- data.table::fread(file.path(
    data_dir,
    "pbta-snv-mutect2.vep.maf.gz"
  ),
  select = "Tumor_Sample_Barcode",
  skip = 1,
  data.table = FALSE
  ) %>%
    dplyr::pull("Tumor_Sample_Barcode")

  # Isolate metadata to only the samples that are in the datasets
  metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv")) %>%
    dplyr::filter(Kids_First_Biospecimen_ID %in% c(strelka2_samples, mutect2_samples)) %>%
    dplyr::distinct(Kids_First_Biospecimen_ID, .keep_all = TRUE) %>%
    dplyr::arrange() %>%
    dplyr::rename(Tumor_Sample_Barcode = Kids_First_Biospecimen_ID) %>%
    readr::write_tsv(file.path(scratch_dir, "metadata_filtered_maf_samples.tsv"))

  # Read in original strelka file with maftools
  strelka2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-strelka2.vep.maf.gz"),
    clinicalData = metadata
  )

  # Save to RDS so we don't have to run this again
  saveRDS(strelka2, strelka2_dir)

  # Same for MuTect2
  mutect2 <- maftools::read.maf(file.path(data_dir, "pbta-snv-mutect2.vep.maf.gz"),
    clinicalData = metadata
  )
  saveRDS(mutect2, mutect2_dir)
}
```

## Compare number of mutations per gene

```{r}
gene_cors <- correlate_maf_summaries(maf1 = strelka2, 
                                     maf2 = mutect2, 
                                     maf1_name = "strelka2", 
                                     maf2_name = "mutect2",
                                     summarize_by = "gene")

# Print out this data.frame
gene_cors %>%
  ggplot2::ggplot(ggplot2::aes(x = variable, y = pearson.cor)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))  
```

## Compare number of mutations per sample

```{r}
sample_cors <- correlate_maf_summaries(maf1 = strelka2, 
                                       maf2 = mutect2, 
                                       maf1_name = "strelka2", 
                                       maf2_name = "mutect2",
                                       summarize_by = "sample")

# Print out this data.frame
sample_cors %>%
  ggplot2::ggplot(ggplot2::aes(x = variable, y = pearson.cor)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))  
```

## Plot Transition/Transversions

These built in maftools functions plot the overall presence of transitions and 
transversions and is recommended by the [maftools vignette](https://bioconductor.org/packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html#74_transition_and_transversions) as a method of obtaining an overall base change
summary. 

#### Strelka2 transition/transversion plot.

```{r}
maftools::plotTiTv(maftools::titv(strelka2))
```

#### MuTect2 transition/transversion plot.

```{r}
maftools::plotTiTv(maftools::titv(mutect2))
```

## Set up new variables

Let's set up Strelka2's variables first. 

```{r}
# Use the premade function to create our new variables
strelka2_vaf <- set_up_variables(strelka2)

# Take a look at this df
strelka2_vaf
```

Now we will do the same for MuTect2.

```{r}
# Use the premade function to create our new variables
mutect2_vaf <- set_up_variables(mutect2)

# Take a look at this df
mutect2_vaf
```

## Combine MuTect2 and Strelka2 data.frames into one data.frame

Join these datasets based on the `mutation_id` created by the `set_up_variables`
function. 
Save to a TSV file to be used in the subsequent notebook.

```{r}
# Merge these data.frames together
vaf_df <- strelka2_vaf %>%
  dplyr::full_join(mutect2_vaf,
    by = "mutation_id",
    suffix = c(".strelka2", ".mutect2")
  ) %>%
  # Make a variable that denotes which dataset it is in.
  dplyr::mutate(dataset = dplyr::case_when(
    is.na(Allele.mutect2) ~ "strelka2_only",
    is.na(Allele.strelka2) ~ "mutect2_only",
    TRUE ~ "both"
  )) %>%
  readr::write_tsv(file.path("results", "combined_results.tsv"))
```

Make a zipped up version that can be stored on GitHub.

```{r}
zip(
  file.path("results", "combined_results.tsv.zip"),
  file.path("results", "combined_results.tsv")
)
```

Session Info: 

```{r}
sessionInfo()
```
